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The news has been called “the first draft of history,” yet news 
publishers have usually left it to libraries and archives to collect, 
organize, and preserve the news they print.” Libraries thus serve 
the essential societal role of ensuring that these materials are 
available to future generations of historians to make sense of the 
past. The nonprofit Internet Archive is a new kind of library built 
for the task of collecting the vast and various sources of news and 
providing free access to researchers, historians, scholars, people 
with print disabilities, and the general public via the internet. : 
This article will briefly describe the Internet Archive’s efforts to 
collect, preserve, organize, and make available the content of 
newspapers, past and present. 


Our newspaper collections result from varied collection practices, 
including crawling and archiving the web, digitizing physical 
media types such as microfilm and microfiche, and relying on 
digital news materials uploaded by citizen-archivists. The Internet 
Archive began in 1996 by archiving the internet itself, a medium 
that was just beginning to grow in use. Like in newspapers, the 
content published on the web was ephemeral, but unlike 
newspapers, no one was Saving it. Today we have over twenty-six 
years of web history accessible through the Wayback Machine, our 
historical archive of the World Wide Web, and we work with over a 
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thousand libraries and other partners through our subscription 
web-archiving service, Archive-It, to identify important webpages 
to add to our collections. We also work with thousands of global 
partners to save copies of their work into special collections, and 
anyone with a free account may upload to the Internet Archive to 
preserve digital objects or the long term. 
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Screenshot of some of the digitized newspapers available in the Internet 
Archive. 


While some access to information is easier than ever, the issues of 
copyright, licenses, paywalls, search, and format obsolescence 
contribute to the challenges our librarians and researchers face 
when working with such varied sources of news. Fortunately, new 
computational tools are making petabytes of archived video, audio, 
online text, and other digitized analog materials much more 
accessible for search and visualizations. If done correctly, 
researchers and the Internet Archive can surface trends and stories 
that help explain and illuminate our past. 


What materials exactly constitute “the news” in our internet era is 
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a complicated question. Gone are the days of having one or two 
newspapers of record for each major metropolitan area. The 
Internet Archive has taken a broad view: digitizing newspapers and 
magazines, archiving worldwide news websites, collecting off-air 
TV broadcasts from US television news programs, recording 
millions of hours of US radio broadcasts, as well as recording many 
podcasts and social media sources. 


Web collections of news sites are being actively collected by 
hundreds of organizations using the Internet Archive’s Archive-It 
tool, which then makes the web collections searchable and part of 
the Wayback Machine.’ For example, the Community Webs 
collections have been collected and curated by over 150 public 
libraries and other cultural heritage organizations to record and 
preserve the stories of their own communities.” These collections 
are organized by topic, such as “2020 Anti-racism Protests,” 
“Brooklyn Politics,” and “Catholic Church Sex Abuse Scandal,” as 
well as by the organization doing the collecting; the collections can 
also be full-text searched. These locally focused web collections 
play an important part in diversifying the historical record and 
preserving the voices of those often excluded from traditional 
archives. For example, the Grand Rapids Public Library has 
digitally collected and preserved El Vocero Hispano, the largest 
Hispanic newspaper in West Michigan.” This newspaper serves a 
relatively small, local community, and it reports on issues tailored 
to their interests. As such, it shines a light on local news, politics, 
events, and culture through the lens of this particular Hispanic 
community. It can also provide a very different lens through which 
to understand the effects of national and international events, like 
the COVID-19 pandemic, on this particular community. 


The Internet Archive also broadly crawls worldwide news sites, 
often updating these daily, which are added to the Wayback 
Machine along with the other gathered materials. News 
organizations, especially those that are independent from 
government control, are often at risk of being shut down alongside 
efforts to prevent access to their archives. This has been true in 
many countries and cities in recent years, including in Turkey, 
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Russia (especially since the war against Ukraine started in 
February of 2022), Taiwan, Hong Kong, Iran, and others.’ The 
Internet Archive has worked with news organizations and media 
rights groups such as Reporters Without Borders to help preserve 
the archives of these sites as well as make them available via the 
public internet.° Two recent examples of that work are the full- 
text searchable indexes of the news sites the Stand News and Apple 
Daily, both of which were shut down by the Chinese government.” 
In many cases, these preserved collections are the only way the 
journalists themselves can access articles they have written. 


While collecting web-based news sites is proceeding at a large 
scale, it is becoming more difficult as paywalls, licensing, and 
technological measures are used to prevent crawlers— including 
libraries. Libraries must therefore take a varied approach to 
collecting and preserving news. The Internet Archive is starting to 
digitize newspapers and to collect digitized newspapers from 
around the globe.” The Newspapers collection now holds over 1.7 
million items dating from the early 1900s through to the early 
1960s, and an additional 50,000 items have been contributed by 
the community in a companion collection that contains mostly 
foreign-language newspapers and related materials.’ Some 
newspapers in this collection have been scanned from donated 
microfilm and microfiche, creating a digital image of each page 
that is then processed into searchable issues. These digital 
collections empower the use of metadata as well as full-text 
search. For example, the search term “nationalism” returns over 
34,000 results, which can further be sorted and filtered by date, 
publisher, language, and so on. 


The newspaper collection includes many smaller, independent 
local news outlets. One example is the Bay Area Reporter, the oldest 
continuously published lesbian, gay, bisexual, transgender, and 
queer weekly newspaper in the United States, and it is the most- 
circulated publication serving the LGBTQ communities of the San 
Francisco Bay Area. ~ Another example is the Boston Phoenix, which 
was the city’s largest alternative weekly in covering local politics, 
arts, and culture from 1973 to 2013, when it folded.” Newspapers 
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like these often cover local events and subjects not covered by 
national outlets. Digitally preserving these newspapers adds both 
depth and breadth to the historical record for historians and 
researchers. Moreover, these digitized collections show the 
image of the actual newspaper. This is helpful for providing 
additional context, such as advertisements, images, and the 
section or page a story ran on, which could indicate its importance 
at the time. 
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the Internet Archive to do text and data mining projects, which can 
accomplish completely different types of analyses on these 
collections. A Global Database of Society project by Kalev H. 
Leetaru provides useful search and trend tools used by journalists 
leveraging the collections in the Internet Archive.’ And in order to 
evaluate the health of local media ecosystems as part of the News 
Measures Research Project, other researchers partnered with the 
Internet Archive to crawl and archive the homepages of 663 local 
news websites representing 100 communities across the United 
States. ° This collection allowed researchers to do a number of 
meta-analyses across these local news sources—for example, it 
allowed them to compare how different local communities cover 
core topics such as emergencies, politics, and transportation. ’ The 
Internet Archive hopes to support many more research projects 
that leverage computational tools on top of these collections. 


For twenty-six years, the Internet Archive has been building news 
and other collections that are freely available to researchers and 
the general public. While already useful to many, there are 
challenges that have slowed progress and accessibility. The 
researcher community can help by using the materials, uploading 
materials, providing feedback to the Internet Archive, and 
publicizing the utility of online resources to overcome our 
challenges.” Libraries live to serve our patrons—please let us 
know how we can do that better. 
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